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Abstract 

We present a framework for the design of optimal assembly algorithms for shotgun sequencing under the criterion of 
complete reconstruction. We derive a lower bound on the read length and the coverage depth required for 
reconstruction in terms of the repeat statistics of the genome. Building on earlier works, we design a de Brujin graph 
based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of 
sequenced genomes, including the GAGE datasets. The results are based on a set of necessary and sufficient conditions 
on the DNA sequence and the reads for reconstruction. The conditions can be viewed as the shotgun sequencing 
analogue of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by Hybridization. 



Introduction 

Problem statement 

DNA sequencing is the basic workhorse of modern day 
biology and medicine. Since the sequencing of the Human 
Reference Genome ten years ago, there has been an explo- 
sive advance in sequencing technology, resulting in several 
orders of magnitude increase in throughput and decrease 
in cost. Multiple "next-generation" sequencing platforms 
have emerged. All of them are based on the whole-genome 
shotgun sequencing method, which entails two steps. First, 
many short reads are extracted from random locations on 
the DNA sequence, with the length, number, and error 
rates of the reads depending on the particular sequencing 
platform. Second, the reads are assembled to reconstruct 
the original DNA sequence. 

Assembly of the reads is a major algorithmic challenge, 
and over the years dozens of assembly algorithms have 
been proposed to solve this problem [1]. Nevertheless, 
the assembly problem is far from solved, and it is not 
clear how to compare algorithms nor where improve- 
ment might be possible. The difficulty of comparing algo- 
rithms is evidenced by the recent assembly evaluations 
Assemblathon 1 [2] and GAGE [3], where which assem- 
bler is "best" depends on the particular dataset as well as 
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the performance metric used. In part this is a conse- 
quence of metrics for partial assemblies: there is an 
inherent tradeoff between larger contiguous fragments 
(contigs) and fewer mistakes in merging contigs (mis- 
joins). But more fundamentally, independent of the 
metric, performance depends critically on the dataset, i.e. 
length, number, and quality of the reads, as well as the 
complexity of the genome sequence. 

With an eye towards the near future, we seek to under- 
stand the interplay between these factors by using the 
intuitive and unambiguous metric of complete reconstruc- 
tion. The notion of complete reconstruction can be 
thought of as a mathematical idealization of the notion of 
"finishing" a sequencing project as defined by the National 
Human Genome Research Institute [4], where finishing a 
chromosome requires at least 95% of the chromosome to 
be represented by a contiguous sequence. Note that this 
objective of reconstructing the original DNA sequence 
from the reads contrasts with the many optimization- 
based formulations of assembly, such as shortest common 
superstring (SCS) [5], maximum-likelihood [6], [7], and 
various graph-based formulations [8], [9]. When solving 
one of these alternative formulations, there is no guarantee 
that the optimal solution is indeed the original sequence. 

Given the goal of complete reconstruction, the most basic 
questions are 1) feasibility: given a set of reads, is it possible 
to reconstruct the original sequence? 2) optimality: which 
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algorithms can successfully reconstruct whenever it is feasi- 
ble to reconstruct? The feasibility question is a measure of 
the intrinsic information each read provides about the 
DNA sequence, and for given sequence statistics depends 
on characteristics of the sequencing technology such as 
read length and noise statistics. As such, it can provide an 
algorithm-independent basis for evaluating the efficiency of 
a sequencing technology. Equally important, algorithms 
can be evaluated on their relative read length and data 
requirements, and compared against the fundamental limit. 

In studying these questions, we consider the most 
basic shotgun sequencing model where N noiseless 
reads (i.e. exact subsequences) of a fixed length L base 
pairs are uniformly and independently drawn from a 
DNA sequence of length G. In this statistical model, fea- 
sibility is rephrased as the question of whether, for given 
sequence statistics, the correct sequence can be recon- 
structed with probability 1 - L when N reads of length L 
are sampled from the genome. We note that answering 
the feasibility question of whether each N, L pair is suf- 
ficient to reconstruct is equivalent to finding the mini- 
mum required N (or the coverage depth c = NLIG) as a 
function of L. 

A lower bound on the minimum coverage depth 
needed was obtained by Lander and Waterman [10]. 
Their lower bound c LW = c LW (L, L) is the minimum 
number of randomly located reads needed to cover the 
entire DNA sequence with a given target success prob- 
ability 1 - L . While this is clearly a necessary condition, 
it is in general not tight: only requiring the reads to 
cover the entire genome sequence does not guarantee 
that consecutive reads can actually be stitched back 
together to recover the original sequence. Characterizing 
when the reads can be reliably stitched together, i.e. 
determining feasibility, is an open problem. In fact, the 
ability to reconstruct depends crucially on the repeat 
statistics of the DNA sequence. 

An earlier work [11] has answered the feasibility and 
optimality questions under an i.i.d. model for the DNA 
sequence. However, real DNA, especially those of eukar- 
yotes, have much longer and complex repeat structures. 
Here, we are interested in determining feasibility and 
optimality given arbitrary repeat statistics. This allows 
us to evaluate algorithms on statistics from already 
sequenced genomes, and gives confidence in predicting 
whether the algorithms will be useful for an unseen gen- 
ome with similar statistics. 

Results 

Our approach results in a pipeline, which takes as input a 
genome sequence and desired success probability 1 - L, 
computes a few simple repeat statistics, and from these 
statistics computes a feasibility plot that indicates for 
which L, N reconstruction is possible. Figure 1 displays 



the simplest of the statistics, the number of repeats as a 
function of the repeat length t. Figure 2 shows the result- 
ing feasibility plot produced for the statistics of human 
chromosome 19 (henceforth he 19) with success probabil- 
ity 99%. The horizontal axis signifies read length L and 
the vertical axis signifies the normalized coverage depth 
c := c/clw> the coverage depth c normalized by c LW , the 
coverage depth required as per Lander- Waterman [10] in 
order to cover the sequence. 

Since the coverage depth must satisfy c > c LW , the nor- 
malized coverage depth satisfies c > 1 , and we plot the 
horizontal line c = 1 ♦ This lower bound holds for any 
assembly algorithm. In addition, there is another lower 
bound, shown as the thick black nearly vertical line in 
Figure 2. In contrast to the coverage lower bound, this 
lower bound is a function of the repeat statistics. It has a 
vertical asymptote at L crit := max{^ interleaved , triple) + h 
where interleaved * s the length of the longest interleaved 
repeats in the DNA sequence and triple is the length of 
the longest triple repeat (see Section for precise defini- 
tions). Our lower bound can be viewed as a generaliza- 
tion of a result of Ukkonen [12] for Sequencing by 
Hybridization to the shotgun sequencing setting. 

Each colored curve in the feasibility plot is the lower 
boundary of the set of feasible N,L pairs for a specific 
algorithm. The rightmost curve is the one achieved by 
the greedy algorithm, which merges reads with largest 
overlaps first (used for example in TIGR [13], CAP3 [14], 
and more recently SSAKE [15]). As seen in Figure 2, its 
performance curve asymptotes at L = £ repeSLV the length of 
the longest repeat. De Brujin graph based algorithms (e.g. 
[16] and [8]) take a more global view via the construction 
of a de Brujin graph out of all the K-mers of the reads. 
The performance curves of all K-mer graph based algo- 
rithms asymptote at read length L = L crit , but different 
algorithms use read information in a variety of ways to 
resolve repeats in the K-mer graph and thus have differ- 
ent coverage depth requirement beyond read length L crit . 
By combining the ideas from several existing algorithms 
(including [8], [17]) we designed MULTIBRIDGING, 
which is very close to the lower bound for this dataset. 
Thus Figure 2 answers, up to a very small gap, the feasi- 
bility of assembly for the repeat statistics of he 19, where 
successful reconstruction is desired with probability 99%. 

We produce similar plots for a dozen or so datasets 
(Additional file 1). For datasets where interleaved is signif- 
icantly larger than £ tr [ P \ e (the majority of the datasets we 
looked at, including those used in the recent GAGE 
assembly algorithm evaluation [3]), MULTIBRIDGING 
is near optimal, thus allowing us to characterize the 
fundamental limits for these repeat statistics (Figure 9). 
On the other hand, if ^triple is close to or larger than 
interleaved; there is a gap between the performance of 
MULTIBRIDGING and the lower bound (see for example 
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Figure 1 For hc19, a log plot of number of repeats as a function of the repeat length £. Red line is what would have been predicted by 
an i.i.d. fit. 



For the he 19 dataset, the critical window size evaluates 
to about 19% of L crit . 

In Sections and, we discuss the underlying analysis and 
algorithm design supporting the plots. The curves are all 
computed from formulas, which are validated by simula- 
tions. We return in Section to put our contributions in a 
broader perspective and discuss extensions to the basic 
framework. All proofs can be found in the appendix. 

Lower bounds 

In this section we discuss lower bounds, due to coverage 
analysis and certain repeat patterns, on the required cov- 
erage depth and read length. The style of analysis here is 
continued in Section, in which we search for an assembly 
algorithm that performs close to the lower bounds. 

Coverage bound 

Lander and Waterman's coverage analysis [10] gives the 
well known condition for the number of reads N LW 



Figure 3). The reason for the gap is explained later in the 
paper. 

An interesting feature of the feasibility plots is that for 
typical repeat statistics exhibited by DNA data, the 
minimum coverage depth is characterized by a critical 
phenomenon: If the read length L is below L crit = £[ nter . 
leaved* reliable reconstruction of the DNA sequence is 
impossible no matter what the coverage depth is, but if 
the read length L is slightly above L crit , then covering 
the sequence suffices, i.e. c = c/clw = 1. The sharpness 
of the critical phenomenon is described by the size of 
the critical window, which refers to the range of L over 
which the transition from one regime to the other 
occurs. For the case when MULTIBRIDGING is near 
optimal, the width W of the window size can be well 
approximated as: 

w% _^_, wherer:= ^A, a) 

2r + 1 logs -1 
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Figure 2 Thick black lines are lower bounds on feasibility which holds for all algorithms, and colored curves are performance 
achieved by specific algorithms. Four such curves are shown: the greedy algorithm and three de Brujin graph based algorithms. 



required to cover the entire DNA sequence with prob- 
ability at least 1 - L. In the regime when L « G, one 
may make the standard assumption that the starting 
locations of the N reads follow a Poisson process with 
rate X = N/G, and the number N LW is to a very good 
approximation given by the solution to the equation 



G N LW 
N LW = - log . 

L 8 



(2) 



The corresponding coverage depth is c LW = N LW L/G. 
This is our baseline coverage depth against which to 
compare the coverage depth of various algorithms. For 
each algorithm, we will plot 



N 



c := 



cm N : 



LW 



the coverage depth required by that algorithm normal- 
ized by c LW . Note that c is also the ratio of the number 
of reads N required by an algorithm to N LW - The require- 
ment c > 1 is due to the lower bound on the number of 
reads obtained by the Lander-Waterman coverage 
condition. 

Ukkonen's condition 

A second constraint on reads arises from repeats. A 
lower bound on the read length L follows from Ukko- 
nen's condition [12]: if there are interleaved repeats or 
triple repeats in the sequence of length at least L - 1, 
then the likelihood of observing the reads is the same 
for more than one possible DNA sequence and hence 
correct reconstruction is not possible. Figure 4 shows an 
example with interleaved repeats. (Note that we assume 
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Figure 3 Performance of MULTIBRIDGING on P Marinus, where € trip i e > interleaved 



1-L >l/2, so random guessing between equally likely 
sequences is not viable.) 

We take a moment to carefully define the various types 
of repeats. Let denote the length-^ subsequence of the 
DNA sequence s starting at position t. A repeat of length 
t is a subsequence appearing twice, at some positions t y t 2 
(so = sf 2 ) that is maximal (i.e. s(ti - 1) * s(t 2 - 1) and 5 
(ti + £) * s(t 2 + £)). Similarly, a triple repeat of length i is 
a subsequence appearing three times, at positions ty t^ t 3 , 
such that s l h = sf 2 = sf 3 , and such that neither of s(t x - 1) = 
s(t 2 - 1) = s(t 3 - 1) nor s{t x + i) = s{t 2 + i) = s{t 3 + Z) holds. 
(Note that a subsequence that is repeated /times gives rise 
to (^) repeats and triple repeats.) A copy is a single 
one of the instances of the subsequence's appearances. A 
pair of repeats refers to two repeats, each having two 
copies. A pair of repeats, one at positions t h t 3 with t\ < t 3 
and the second at positions t^ t 4 with t 2 < t 4 , is interleaved 
if ti < t 2 < t 3 < t 4 or t 2 < ti < £ 4 < t 3 (Figure 4). The length 



of a pair of interleaved repeats is defined to be the length 
of the shorter of the two repeats. 

Ukkonen's condition implies a lower bound on the 
read length, 

L > ^crit - = max{€ interleaved / -^triple} + !• 

Here interleaved is the length of the longest pair of 
interleaved repeats on the DNA sequence and £ t ri P ie is 
the length of the longest triple repeat. 

Ukkonen's condition says that for read lengths less 
than L crit , reconstruction is impossible no matter what 
the coverage depth is. But it can be generalized to pro- 
vide a lower bound on the coverage depth for read 
lengths greater than L crit , through the important concept 
of bridging as shown in Figure 5. We observe that in 
Ukkonens interleaved or triple repeats, the actual length 
of the repeated subsequences is irrelevant; rather, to 
cause confusion it is enough that all the copies of the 
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pertinent repeats are unbridged. This leads to the fol- 
lowing theorem. 

Theorem 1. Given a DNA sequence s and a set of reads, 
if there is a pair of interleaved repeats or a triple repeat 
whose copies are all unbridged, then there is another 
sequence s' of the same length under which the likelihood 
of observing the reads is the same. 

For brevity, we will call a repeat or a triple repeat 
bridged if at least one copy of the repeat is bridged, and a 
pair of interleaved repeats bridged if at least one of the 
repeats is bridged. Thus, the above theorem says that a 
necessary condition for reconstruction is that all inter- 
leaved and triple repeats are bridged. 

How does Theorem 1 imply a lower bound on the cov- 
erage depth? Focus on the longest pair of interleaved 
repeats and suppose the read length L is between the 
lengths of the shorter and the longer repeats. The prob- 
ability this pair is unbridged is f/?^ nbnd§ed ) 2 , where 

vr ^interleaved ' 

^unbridged ;= pj£ _ i en gth subseq. is unbridged] 

N (3) 
= et> 

Theorem 1 implies that the probability of making an 
error in the reconstruction is at least 1/2 if this event 
occurs. Hence, the requirement that P e rror ^ L implies a 
lower bound on the number of reads N: 



N > 



{L - interleaved - 1) ln(l/(2fi)) ' 



(4) 



A similar lower bound can be derived using the long- 
est triple repeat. A slightly tighter lower bound can be 
obtained by taking into consideration the bridging of all 
the interleaved and triple repeats, not only the longest 
one, resulting in the black curve in Figure 2. 

Towards optimal assembly 

We now begin our search for algorithms performing close 
to the lower bounds derived in the previous section. Algo- 
rithm assessment begins with obtaining deterministic suf- 
ficient conditions for success in terms of repeat-bridging. 
We then find the necessary N and L in order to satisfy 
these sufficient conditions with a target probability 1 - L . 
The required coverage depth for each algorithm depends 
only on certain repeat statistics extracted from the DNA 
data, which may be thought of as sufficient statistics. 

Greedy algorithm 

The greedy algorithm, denoted GREEDY, with pseudo- 
code in the supplementary material, is described as fol- 
lows. Starting with the initial set of reads, the two 
fragments (i.e. subsequences) with maximum length 
overlap are merged, and this operation is repeated until a 
single fragment remains. Here the overlap of two frag- 
ments x, y is a suffix of x equal to a prefix of y, and mer- 
ging two fragments results in a single longer fragment. 

Theorem 2. GREEDY reconstructs the original sequence s 
if every repeat is bridged. 

Theorem 2 allows us to determine the coverage depth 
required by GREEDY: we must ensure that all repeats are 
bridged. By the union bound, 
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Figure 5 A subsequence s^ is bridged if and only if there exists at least one read which covers at least one base on both sides of the 
subsequence, i.e. the read arrives in the preceding length L-ZA interval 
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where p^ nbndged is defined in (3) and a m is the num- 
ber of repeats of length m. Setting the right-hand side 
of (5) to g ensures P e rror ^ L and yields the performance 
curve of GREEDY in Figure 2. Note that the repeat sta- 
tistics {a m } are sufficient to compute this curve. 

GREEDY requires L > £ rep eat + h whereas the lower 
bound has its asymptote at L = interleaved + 1- m chromo- 
some 19, for instance, there is a large difference between 
interleaved = 2248 and £ re peat = 4092, and in Figure 2 we see 
a correspondingly large gap. GREEDY is evidently sub- 
optimal in handling interleaved repeats. Its strength, how- 
ever, is that once the reads are slightly longer than i ep eat> 
coverage of the sequence is sufficient for correct recon- 



struction. Thus if t 
close to optimal. 



repeat 



interleaved* then GREEDY is 



/C-mer algorithms 

The greedy algorithm fails when there are unbridged 
repeats, even if there are no unbridged interleaved repeats, 
and therefore requires a read length much longer than 
that required by Ukkonen's condition. As we will see, 
7<T-mer algorithms do not have this limitation. 

Background 

In the introduction we mention Sequencing By Hybridiza- 
tion (SBH), for which Ukkonen's condition was originally 
introduced. In the SBH setting, an optimal algorithm match- 
ing Ukkonens condition is known, due to Pevzner [18]. 

Pevzner's algorithm is based on finding an appropriate 
cycle in a 7<T-mer graph (also known as a de Bruijn graph) 
with K = L - 1 (see e.g. [19] for an overview). A 7<T-mer 
graph is formed by first creating a node in the graph for 
each unique 7<T-mer (length K subsequence) in the set of 
reads, and then adding an edge with overlap K - 1 
between any two nodes representing 7<T-mers that are 
adjacent in a read, i.e. offset by a single nucleotide. Edges 
thus correspond to unique (K + l)-mers in s and paths 



correspond to longer subsequences obtained by merging 
the constituent nodes. There exists a cycle corresponding 
to the original sequence s, and reconstruction entails 
finding this cycle. 

As is common, we will replace edges corresponding to 
an unambiguous path by a single node (c.f. Figure 6). 
Since the subsequences at some nodes are now longer 
than K, this is no longer a 7<T-mer graph, and we call the 
more general graph a sequence graph. The simplified 
graph is called the condensed sequence graph. 

The condensed graph has the useful property that if 
the original sequence s is reconstructible, then s is 
determined by a unique Eulerian cycle: 

Theorem 3. Let G 0 be the K-mer graph constructed 
from the (K + l)-spectrum Sk+i ofs, and let G be the con- 
densed sequence graph obtained from G 0 . If Ukkonen's 
condition is satisfied, i.e. there are no triple or interleaved 
repeats of length at least K, then there is a unique Eulerian 
cycle C in G and C corresponds to s. 

Theorem 3 characterizes, deterministically, the values 
of K for which reconstruction from the (K + ^-spec- 
trum is possible. We proceed with application of the K- 
mer graph approach to shotgun sequencing data. 

Basic /C-mer algorithm 

Starting with Idury and Waterman [16], and then Pevzner 
et al.'s [8] EULER algorithm, most current assembly algo- 
rithms for shotgun sequencing are based on the 7<T-mer 
graph. Idury and Waterman [16] made the key observation 
that SBH with subsequences of length K+l can be emulated 
by shotgun sequencing if each read overlaps the subsequent 
read by K: the set of all (K +l)-mers within the reads is 
equal to the (K+l) -spectrum Sk+i* The resultant algorithm 
DEBRUIJN which consists of constructing the 7<T-mer graph 
from the (K+l) -spectrum observed in the reads, condensing 
the graph, and then identifying an Eulerian cycle, has suffi- 
cient conditions for correct reconstruction as follows. 

Theorem 4. DEBRUIJNtW£/z parameter choice K 
reconstructs the original sequence s if: 

(a) K > ^interleaved 
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(b) K > e triple 

(c) adjacent reads overlap by at least K 

Lander and Waterman's coverage analysis applies also 
to Condition (c) of Theorem 4, yielding a normalized 
coverage depth requirement c = 1/(1 — K/L). The larger 
the overlap K, the higher the coverage depth required. 
Conditions (a) and (b) say that the smallest K one can 

Choose iS K = max{£ trip i e , interleaved} + h SO 
1 

Q — 

max{€ trip i e/ 

£ interleaved } + l * (6) 
L 

The performance of DEBRUIJN is plotted in Figure 2. 
DEBRUIJN significantly improves on GREEDY by obtain- 
ing the correct first order performance: given sufficiently 
many reads, the read length L may be decreased to {triple* 
interleaved) + 1- Still, the number of reads required to 
approach this critical length is far above the lower 
bound. The following subsection pursues reducing K in 
order to reduce the required number of reads. 

Improved /C-mer algorithms 

Algorithm DEBRUIJN ignores a lot of information con- 
tained in the reads, and indeed all of the 7<T-mer based 
algorithms proposed by the sequencing community 
(including [16], [8], [20], [21], [22], [23]) use the read 
information to a greater extent than the naive DEB- 
RUIJN algorithm. Better use of the read information, as 
described below in algorithms SIMPLEBRIDGING and 
MULTIBRIDGING, will allow us to relax the condition 
K >max{£ interleaved , triple} f° r success of DEBRUIJN, 
which in turn reduces the high coverage depth required 
by Condition (c). 

Existing algorithms use read information in a variety of 
distinct ways to resolve repeats. For instance, Pevzner et 
al. [8] observe that for graphs where each edge has multi- 
plicity one, if one copy of a repeat is bridged, the repeat 
can be resolved through what they call a "detachment". 
The algorithm SIMPLEBRIDGING described below is 



very similar, and resolves repeats with two copies if at 
least one copy is bridged. 

Meanwhile, other algorithms are better suited to 
higher edge multiplicities due to higher order repeats; 
IDBA (Iterative DeBruijn Assembler) [17] creates a ser- 
ies of AT-mer graphs, each with larger K> and at each 
step uses not just the reads to identify adjacent 7<T-mers, 
but also all the unbridged paths in the 7<T-mer graph 
with smaller K. Although not stated explicitly in their 
paper, we observe here that if all copies of every repeat 
are bridged, then IDBA correctly reconstructs. 

However, it is suboptimal to require that all copies of 
every repeat up to the maximal K be bridged. We intro- 
duce MULTIBRIDGING, which combines the aforemen- 
tioned ideas to simultaneously allow for single-bridged 
double repeats, triple repeats in which all copies are 
bridged, and unbridged non-interleaved repeats. 

SimpleBridging 

SIMPLEBRIDGING improves on DEBRUIJN by resolving 
bridged 2-repeats (i.e. a repeat with exactly two copies in 
which at least one copy is bridged by a read). Condition 
(a) K > interleaved f° r success of DEBRUIJN (ensuring that 
no interleaved repeats appear in the initial K-mer graph) 
is updated to require only no unbridged interleaved 
repeats, which matches the lower bound. With this 
change, Condition (b) K > triple forms the bottleneck for 
typical DNA sequences. Thus SIMPLEBRIDGING is 
optimal with respect to interleaved repeats, but it is sub- 
optimal with respect to triple repeats. 

SIMPLEBRIDGING deals with repeats by performing 
surgery on certain nodes in the sequence graph. In the 
sequence graph, a repeat corresponds to a node we call 
an X-node, a node with in-degree and out-degree each at 
least two (e.g. Figure 7). A self-loop adds one each to the 
in degree and out-degree. The cycle C(s) traverses each 
X-node at least twice, so X-nodes correspond to repeats 
in s. We call an X-node traversed exactly twice a 2-X- 
node; these nodes correspond to 2-repeats, and are said 
to be bridged if the corresponding repeat in s is bridged. 
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Figure 7 An example of the bridging step in SIMPLEBRIDGING. 
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In the repeat resolution step of SIMPLEBRIDGING 
(illustrated in Figure 7), bridged 2-X-nodes are duplicated 
in the graph and incoming and outgoing edges are inferred 
using the bridging read, reducing possible ambiguity. 

Theorem 5. SIMPLEBRIDGINGw/£/z parameter choice 
K reconstructs the original sequence s if: 

(a) all interleaved repeats are bridged 

(b) K 

^ ^triple 

(c) adjacent reads overlap by at least K. 
By the union bound, 



[some interleaved repeat is unb ridged] 

;E^»(^ nbrid8ed ) 2 (P» nbrid8ed ) 2 



(7) 



where b m>n is the number of interleaved repeats in which 
one repeat is of length m and the other is of length n. To 
ensure that condition (a) in the above theorem fails with 
probability no more than/., the right hand side of (7) is set 
to be/.; this imposes a constraint on the coverage depth. 
Furthermore, conditions (b) and (c) imply that the nor- 
malized coverage depth c > 1/(1 — (^triple + 1)/^). These 
two constraints together yield the performance curve of 
SIMPLEBRIDGING in Figure 2. 

MultiBridging 

We now turn to triple repeats. As previously observed, it 
can be challenging to resolve repeats with more than one 
copy [8], because an edge into the repeat may be paired 
with more than one outgoing edge. As discussed above, 
our approach here shares elements with IDBA [17]: we 
note that increasing the node length serves to resolve 
repeats. Unlike IDBA, we do not increase the node length 
globally. 

As noted in the previous subsection, repeats corre- 
spond to nodes in the sequence graph we call X-nodes. 



Here the converse is false: not all repeats correspond to 
X-nodes. A repeat is said to be all-bridged if all repeat 
copies are bridged, and an X-node is called all-bridged if 
the corresponding repeat is all-bridged. 

The requirement that triple repeats be all bridged allows 
them to be resolved locally (Figure 8). The X-node resolu- 
tion procedure given in Step 4 of MULTIBRIDGING can 
be interpreted in the 7<T-mer graph framework as increas- 
ing K locally so that repeats do not appear in the graph. In 
order to do this, we introduce the following notation for 
extending nodes: Given an edge (v, q) with weight <2 v>q , 
let v^ q denote v extended one base to the right along 
(v, q), i.e. v^ q = vq^ +1 (notation introduced in Sec). 

Similarly, let = p^^v. MULTIBRIDGING is 
described as follows. 

Algorithm 1 MULTIBRIDGING. Input: reads K , 
parameter 7<T. Output: sequence g • 

K-mer steps 1-3: 

1. For each subsequence x of length K in a read, form 
a node with label x. 

2. For each read, add edges between nodes represent- 
ing adjacent 7<T-mers in the read. 

3. Condense the graph (c.f. Figure 6). 

4. Bridging step: (See Figure 8). While there exists a 
bridged X-node v: (i) For each edge (p,-, v) with weight 
a Pi,v, create a new node = Pi ^v and an edge (p if u,-) 
with weight 1 +a p t v. Similarly for each edge (v, q ; ), 
create a new node Wj = and edge (wj , q ; ). (ii) If v 
has a self-loop (v, v) with weight a Y>v , add an edge 
(i/^V^v) with weight a Y}Y + 2. (iii) Remove node v 
and all incident edges, (iv) For each pair w b w ; adjacent 
in a read, add edge (u/,w ; ). If exactly one each of the u,- 
and Wj nodes have no added edge, add the edge, (v) 
Condense graph. 
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Figure 8 MULTIBRIDGING resolves an X-node with label ATTGCAA corresponding to a triple repeat. 



5. Finishing step: Find an Eulerian cycle in the graph 
and return the corresponding sequence. 

Theorem 6. The algorithm MULTIBRIDGINGrecon- 
structs the sequence s if: 

(a) all interleaved repeats are bridged 

(b) all triple repeats are all-bridged 

(c) the sequence is covered by the reads. 

A similar analysis as for SIMPLEBRIDGING yields the 
performance curve of MULTIBRIDGING in Figure 2. 

Gap to lower bound 

The only difference between the sufficient condition 
guaranteeing the success of MULTIBRIDGING and the 



necessary condition of the lower bound is the bridging 
condition of triple repeats: while MULTIBRIDGING 
requires bridging all three copies of the triple repeats, 
the necessary condition requires only bridging a single 
copy. When triple is significantly smaller than interleaved* 
the bridging requirement of interleaved repeats domi- 
nates over that of triple repeats and MULTIBRIDGING 
achieves very close to the lower bound. This occurs in 
hcl9 and the majority of the datasets we looked at. (See 
Figure 9 and the plots in additional file 1.) A critical 
phenomenon occurs as L increases: for L < L crit recon- 
struction is impossible, over a small critical window the 
bridging requirement of interleaved repeats (primarily 
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Figure 9 Simulation results for each of the GAGE reference genomes. Each simulated {N,L) point is marked with the number of correct 
reconstructions (e.g. 93, 98, 95) on 100 simulated read sets. All four algorithms (GREEDY, DEBRUIJN, SIMPLEBRIDGING, and MULTIBRIDGING) were 
run on S. Aureus, R. sphaeroides and hc14. Note that MULTIBRIDGING is very close to the lower bound on all 3 datasets. 



the longest) dominates, and then for larger L, coverage 
suffices. 

On the other hand, when triple is comparable or lar- 
ger than ^interleaved; then MULTIBRIDGING has a gap in 
the coverage depth to the lower bound (see for example 
Figure 3). If we further assume that the longest triple 
repeat is dominant, then this gap can be calculated to 

be a factor of 3 • 1<3g3g - ^3.72 for/. = 10 2 . This gap 
logs -1 

occurs only within the critical window where the 
repeat-bridging constraint is active. Beyond the critical 
window, the coverage constraint dominates and MUL- 
TIBRIDGING is optimal. Further details are provided in 
the appendices. 

Simulations and complexity 

In order to verify performance predictions, we imple- 
mented and ran the algorithms on simulated error-free 
reads from sequenced genomes. For each algorithm, we 
sampled (N, L) points predicted to give <5% error, and 
recorded the number of times correct reconstruction 
was achieved out of 100 trials. Figure 9 shows results 
for the three GAGE reference sequences. 

We now estimate the run-time of MULTIBRIDGING. 
The algorithm has two phases: the 7<T-mer graph forma- 
tion step, and the repeat resolution step. The 7<T-mer 
graph formation runtime can be easily bounded by O 
{(L-K)NK), assuming 0(K) look-up time for each of the 
(L-K)N 7<T-mers observed in reads. This step is common 
to all 7<T-mer graph based algorithms, so previous works 
to decrease the practical runtime or memory require- 
ments are applicable. 

The repeat resolution step depends on the repeat sta- 
tistics and choice of K. It can be loosely bounded as 

max repeats x d x J . The second sum is over 

\ of length I J 



distinct of length maximal repeats x of length t and d x is 
the number of (not necessarily maximal) copies of repeat x. 
The bound comes from the fact that each maximal repeat 
of length K < i < L is resolved via exactly one bridged 
X-node, and each such resolution requires examining at 
most the Ld x distinct reads that contain the repeat. We 

note that / jn ^ / max repeats x d x < L/ at anc [ the 

latter quantity is easily computable from our sufficient 
statistics. 

For our data sets, with appropriate choice of K, the 
bridging step is much simpler than the 7<T-mer graph for- 
mation step: for R. sphaeroides we use K = 40 to get 

EL 
Lap = 412; in contrast, N >22421 for the relevant 
i=K 1 

range of L. Similarly, for hcl4, using K - 300, 

EL 
Lap = 661 while N >733550; for S. Aureus, 
l=K 

E L Lap = 558 while N >8031. 
i=K 1 

Discussions and extensions 

The notion of optimal shotgun assembly is not commonly 
discussed in the literature. One reason is that there is no 
universally agreed-upon metric of success. Another reason 
is that most of the optimization-based formulations of 
assembly have been shown to be NP-hard, including 
Shortest Common Superstring [24], [5], De Bruijn Super- 
walk [8], [25], and Minimum s-Walk on the string graph 
[9], [25]. Thus, it would seem that optimal assembly algo- 
rithms are out of the question from a computational per- 
spective. What we show in this paper is that if the goal is 
complete reconstruction, then one can define a clear 
notion of optimality, and moreover there is a computa- 
tionally efficient assembly algorithm (MULTIBRIDGING) 
that is near optimal for a wide range of DNA repeat statis- 
tics. So while the reconstruction problem may well be NP- 



Bresler et al. BMC Bioinformatics 2013, 14(Suppl 5):S18 
http://www.biomedcentral.eom/1 471 -21 05/1 4/S5/S1 8 



Page 12 of 13 



hard, typical instances of the problem seem much easier 
than the worst-case, a possibility already suggested by 
Nagarajan and Pop [26]. 

The MULTIBRIDGING algorithm is near optimal in 
the sense that, for a wide range of repeat statistics, it 
requires the minimum read length and minimum cover- 
age depth to achieve complete reconstruction. However, 
since the repeat statistics of a genome to be sequenced 
are usually not known in advance, this minimum 
required read length and minimum required coverage 
depth may also not be known in advance. In this con- 
text, it would be useful for the Multi- Bridging algorithm 
to validate whether its assembly is correct. More gener- 
ally, an interesting question is to seek algorithms which 
are not only optimal in their data requirements but also 
provide a measure of confidence in their assemblies. 

How realistic is the goal of complete reconstruction 
given current-day sequencing technologies? The mini- 
mum read lengths L crit required for complete reconstruc- 
tion on the datasets we examined are typically on the 
order of 500-3000 base pairs (bp). This is substantially 
longer than the reads produced by Illumina, the current 
dominant sequencing technology, which produces reads 
of lengths 100-200bp; however, other technologies pro- 
duce longer reads. PacBio reads can be as long as several 
thousand base pairs, and as demonstrated by [27], the 
noise can be cleaned by Illumina reads to enable near 
complete reconstruction. Thus our framework is already 
relevant to some of the current cutting edge technologies. 
To make our framework more relevant to short-read 
technologies such as Illumina, an important direction is 
to incorporate mate-pairs in the read model, which can 
help to resolve long repeats with short reads. Other 
extensions to the basic shotgun sequencing model: het- 
erogenous read lengths: This occurs in some technolo- 
gies where the read length is random (e.g. Pacbio) or 
when reads from multiple technologies are used. Gener- 
alized Ukkonens conditions and the sufficient conditions 
of MULTIBRIDGING extend verbatim to this case, and 
only the computation of the bridging probability (3) has 
to be slightly modified. 

non-uniform read coverage: Again, only the compu- 
tation of the bridging probability has to be modified. 
One issue of interest is to investigate whether reads are 
sampled less frequently from long repeat regions. If so, 
our framework can quantify the performance hit. 

double strand: DNA is double-stranded and consists 
of a length-G sequence u and its reverse complement 
u . Each read is either sampled from u or u . This more 
realistic scenario can be mapped into our single-strand 
model by defining s as the length-2G concatenation of u 
and u, transforming each read into itself and its reverse 
complement so that there are 2N reads. Generalized 
Ukkonen's conditions hold verbatim for this problem, 



and MULTIBRIDGING can be applied, with the slight 
modification that instead of looking for a single Eulerian 
path, it should look for two Eulerian paths, one for each 
component of the sequence graph after repeat-resolu- 
tion. An interesting aspect of this model is that, in addi- 
tion to interleaved repeats on the single strand u, 
reverse complement repeats on u will also induce inter- 
leaved repeats on the sequence s. 

Additional material 



Additional file 1: In this supplementary material, we display in Figures 
10-17 the output of our pipeline for 9 datasets (in addition to hc19, 
whose output is in the introduction, and the GAGE datasets R. 
sphaeroides, S. Aureus, and hc14). For each dataset we plot 
log(l + CLi), the log of one plus the number of repeats of each 
length L From the repeat statistics a mi b m/n , and c mi we produce a 
feasibility plot. The thick black line denotes the lower bound on feasible 
N, L, and the green line is the performance achieved by MULTIBRIDGING. 
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